This project is part of the Google Data Analytics Capstone: Case Study #1, and it will follow Google’s 6 steps of the data analysis process: ask, prepare, process, analyze, share and act. RStudio/Kaggle will be used throughout the data collection, analysis and visualization processes.
Cyclistic is a fictional bike-share company in Chicago with a fleet of over 5,800 bicycles and 600 docking stations. It offers both classic bicycles as well as pedal-assist e-bikes. Customers can pay by one of three methods: single-day pass, full-day pass or annual membership.
The aim of this case study, as Director of Marketing Lily Moreno states, is to increase profits by converting single-ride and full-day customers (both known as casual riders) into annual members.
Though Cyclistic is an imaginary company, real data will be obtained
through an actual Chicago-based bike-sharing service called Divvy.
During the Ask phase, the first step is to define the problem being solved:
How do we convert current casual Cyclistic customers into annual members?
In order to answer this, Moreno has formulated a few questions to guide the team as they conduct an analysis on historical ride data. The specific question she has assigned me is:
How do annual members and casual riders use Cyclistic bikes differently?
The metrics I intend to use in my analysis to identify trends and gain insights about casual riders and annual members include a combination of:
We will be using actual trip history from Divvy, which is made publicly available under their Data License Agreement.
Because data privacy protects the customer’s identity and personal information, useful data such as number of rides per user, age, gender, occupation, and physical address (to differentiate tourists from local residents) cannot be used for our analysis.
Variables provided in this dataset include:
1.
ride_id (unique key)
2.
rideable_type (type of bike)
3-4.
started_at & ended_at
(starting and ending times of rental)
5-6.
start_station_name &
end_station_name (name of starting and ending
station)
7-8. start_station_id &
end_station_id (ID of stating and ending
station)
12. start_lng,
start_lat, end_lng &
end_lat (exact longitude/latitude of start and end
of ride)
13. member_casual (whether the
customer is single-ride/full-day or has an annual membership)
First, we load the required libraries.
library(tidyverse)
library(readr) # Import CSV files
library(ggmap) # Google Maps API for data visualizations
library(scales) # For ggplot axis scales
library(reshape) # Convert wide data to long data for ggplot
Set the working directory and import the 12 most recent monthly trip histories from CSV format into dataframes.
setwd("./data") # Set working directory
# Import 12 monthly CSVs
trips_2022_06 <- read_csv("202206-divvy-tripdata.csv", show_col_types = FALSE)
trips_2022_07 <- read_csv("202207-divvy-tripdata.csv", show_col_types = FALSE)
trips_2022_08 <- read_csv("202208-divvy-tripdata.csv", show_col_types = FALSE)
trips_2022_09 <- read_csv("202209-divvy-tripdata.csv", show_col_types = FALSE)
trips_2022_10 <- read_csv("202210-divvy-tripdata.csv", show_col_types = FALSE)
trips_2022_11 <- read_csv("202211-divvy-tripdata.csv", show_col_types = FALSE)
trips_2022_12 <- read_csv("202212-divvy-tripdata.csv", show_col_types = FALSE)
trips_2023_01 <- read_csv("202301-divvy-tripdata.csv", show_col_types = FALSE)
trips_2023_02 <- read_csv("202302-divvy-tripdata.csv", show_col_types = FALSE)
trips_2023_03 <- read_csv("202303-divvy-tripdata.csv", show_col_types = FALSE)
trips_2023_04 <- read_csv("202304-divvy-tripdata.csv", show_col_types = FALSE)
trips_2023_05 <- read_csv("202305-divvy-tripdata.csv", show_col_types = FALSE)
Now that all 12 months are imported, we verify that the first and last months have the same number of columns, and that each column has the same name and type.
spec(trips_2022_06) # First month
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
spec(trips_2023_05) # Last month
## cols(
## ride_id = col_character(),
## rideable_type = col_character(),
## started_at = col_datetime(format = ""),
## ended_at = col_datetime(format = ""),
## start_station_name = col_character(),
## start_station_id = col_character(),
## end_station_name = col_character(),
## end_station_id = col_character(),
## start_lat = col_double(),
## start_lng = col_double(),
## end_lat = col_double(),
## end_lng = col_double(),
## member_casual = col_character()
## )
After confirming that column names and types match, we merge all 12 months into a dataframe named trips_combined.
trips_combined <- rbind(trips_2022_06, trips_2022_07, trips_2022_08, trips_2022_09,
trips_2022_10, trips_2022_11, trips_2022_12, trips_2023_01,
trips_2023_02, trips_2023_03, trips_2023_04, trips_2023_05)
Finally, if successful, remove the 12 monthly dataframes to free up some resources from this memory-intensive project.
rm(trips_2022_06, trips_2022_07, trips_2022_08, trips_2022_09, trips_2022_10,
trips_2022_11, trips_2022_12, trips_2023_01, trips_2023_02, trips_2023_03,
trips_2023_04, trips_2023_05)
Now that the entire dataset has been imported, the Process phase
consists of checking and cleaning data. Though data manipulation is
typically done in the Analyze step, I have done some of it here as it
goes hand-in-hand with the cleaning processes.
First, we will check all fields for any NULL values.
sum(is.na(trips_combined$ride_id))
## [1] 0
sum(is.na(trips_combined$rideable_type))
## [1] 0
sum(is.na(trips_combined$started_at))
## [1] 0
sum(is.na(trips_combined$ended_at))
## [1] 0
sum(is.na(trips_combined$start_station_name))
## [1] 834545
sum(is.na(trips_combined$end_station_name))
## [1] 891757
sum(is.na(trips_combined$start_station_id))
## [1] 834677
sum(is.na(trips_combined$end_station_id))
## [1] 891898
sum(is.na(trips_combined$start_lat))
## [1] 0
sum(is.na(trips_combined$start_lng))
## [1] 0
sum(is.na(trips_combined$end_lat))
## [1] 5961
sum(is.na(trips_combined$end_lng))
## [1] 5961
sum(is.na(trips_combined$member_casual))
## [1] 0
From the results, we can see the following 6 fields have NULL values: start_station_name, end_station_name, start_station_id, end_station_id, end_lat, and end_lng
Regarding start/end station names and IDs, Divvy’s ebike page website states, “you can also lock [ebikes] to any other public bike rack, light pole, signpost, or retired parking meter within the service area.”
In other words, ebikes do not have to be returned to or obtained from stations. Let’s verify whether these trips with NULL stations are ebikes:
sum(is.na(trips_combined$start_station_name))
## [1] 834545
sum(is.na(trips_combined$start_station_name) & trips_combined$rideable_type == "electric_bike")
## [1] 834545
Indeed, all trips with NULL start stations are ebikes, so we can safely ignore these fields.
Next, we take a closer look at the trips with NULL end_lat and end_lng values. After saving rides with NULL end coordinates to null_end_coord, I take a look at the results manually and select the columns in which I believe are relevant.
null_end_coord <- filter(trips_combined, is.na(end_lat))
# View selected columns
head(select(null_end_coord, rideable_type, started_at, ended_at, member_casual), n = 10L)
## # A tibble: 10 × 4
## rideable_type started_at ended_at member_casual
## <chr> <dttm> <dttm> <chr>
## 1 classic_bike 2022-06-26 20:26:45 2022-06-27 21:26:37 casual
## 2 classic_bike 2022-06-26 20:43:12 2022-06-27 21:43:07 casual
## 3 docked_bike 2022-06-14 13:56:27 2022-06-16 23:05:05 casual
## 4 classic_bike 2022-06-27 19:08:54 2022-06-28 20:08:50 casual
## 5 classic_bike 2022-06-25 18:59:21 2022-06-26 19:59:16 casual
## 6 classic_bike 2022-06-30 21:35:14 2022-07-01 22:35:07 casual
## 7 classic_bike 2022-06-19 13:03:29 2022-06-20 14:03:24 casual
## 8 classic_bike 2022-06-27 16:18:55 2022-06-28 17:18:44 member
## 9 classic_bike 2022-06-07 23:57:12 2022-06-09 00:57:01 casual
## 10 docked_bike 2022-06-14 18:44:50 2022-06-15 19:44:50 casual
From the first 10 rows (a detailed look at all trips also confirms this), we see that only classic and docked bikes (more about docked bikes later) are listed, and many of these trips are over 24 hours long.
These bikes may have been temporarily lost, stolen or some sort of user/system error. It’s also a small enough number of trips that I believe it is safe to remove these from the dataset.
trips_cleaning <- filter(trips_combined, !is.na(end_lat))
The only unique value is ride_id, so we count the number of duplicated IDs and see that none exist.
sum(duplicated(trips_cleaning$ride_id))
## [1] 0
Calculate trip duration in minutes using started_at and ended_at and store this value in new column ride_duration. Also convert type from difftime to numeric (to avoid ggplot issues).
trips_cleaning <- trips_cleaning %>%
mutate(ride_duration = difftime(ended_at, started_at, units = "mins"))
# Convert from difftime to numeric
trips_cleaning$ride_duration <- as.numeric(trips_cleaning$ride_duration)
Next, check for any negative durations, rides between 0 and 1 minutes, and over 1 day (1440 minutes).
Divvy states that they have removed, “any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it was secure).” However, this does not seem to be the case as there are still many entries under 1 minute.
# Negative trip duration
sum(trips_cleaning$ride_duration < 0)
## [1] 112
# Between 0 and 1 minutes
sum(trips_cleaning$ride_duration >= 0 & trips_cleaning$ride_duration < 1)
## [1] 145858
# Over 1 day ()
sum(trips_cleaning$ride_duration > 1440)
## [1] 126
And finally, we the clean the data by removing trips under 1 minute (including negative durations) or over 1 day.
trips_cleaning <- subset(trips_cleaning, ride_duration >= 1 & ride_duration <= 1440)
Display types of bikes offered.
table(trips_cleaning$rideable_type)
##
## classic_bike docked_bike electric_bike
## 2546228 152996 2977749
From Divvy’s bike page, it is clear what classic and electric bikes are. However, their entire website does not seem to mention what a docked bike is. Let’s create a dataframe for docked bikes only and take a closer look at it.
docked_bikes <- filter(trips_cleaning, rideable_type == "docked_bike")
head(select(docked_bikes, rideable_type, started_at, ended_at, member_casual), n = 10L)
## # A tibble: 10 × 4
## rideable_type started_at ended_at member_casual
## <chr> <dttm> <dttm> <chr>
## 1 docked_bike 2022-06-13 16:16:58 2022-06-13 17:42:51 casual
## 2 docked_bike 2022-06-20 16:08:27 2022-06-20 16:28:32 casual
## 3 docked_bike 2022-06-18 23:30:08 2022-06-19 00:22:26 casual
## 4 docked_bike 2022-06-19 13:11:13 2022-06-19 13:19:00 casual
## 5 docked_bike 2022-06-14 16:38:02 2022-06-14 17:06:03 casual
## 6 docked_bike 2022-06-09 17:12:03 2022-06-09 17:28:52 casual
## 7 docked_bike 2022-06-01 09:38:42 2022-06-01 10:08:18 casual
## 8 docked_bike 2022-06-22 00:23:25 2022-06-22 00:26:52 casual
## 9 docked_bike 2022-06-13 12:36:49 2022-06-13 13:05:17 casual
## 10 docked_bike 2022-06-02 14:41:29 2022-06-02 15:09:06 casual
table(docked_bikes$member_casual)
##
## casual
## 152996
The value that stands out here is member_casual, which states that all docked bikes were rented by casual customers. But it still does not answer the question as to what a docked bike is. They comprise roughly 2.7% of all trips, so the dilemma is whether to include them in the dataset or not. I Googled, “Divvy ‘docked_bike’ meaning,” and it seems several other people working on this case study also had the same issue.
Kelly Luu and sanji claim these were bikes removed from circulation, and thus removed from the dataset. Jeremy Rieunier contacted Divvy, who replied and mentioned they used docked_bike for all classic bikes, then gradually started reclassifying them as classic_bike sometime in 2020.
Browsing through Divvy’s trip repository, it appears all trips through June 2020 were labeled as docked_bike.
In July 2020, electric_bike was added. All others were still docked_bike. The official City of Chicago website confirms the introduction of ebikes late July 2020.
Finally, in December 2020 we see the first use of classic_bike, though still mixed in with docked_bike. It is likely that many bikes still retained the docked_bike name after the migration.
Knowing the above, I will go ahead and rename all docked_bike to classic_bike. However, the question as to why all docked_bike are casual riders remains unanswered.
trips_cleaning$rideable_type[trips_cleaning$rideable_type == "docked_bike"] <- "classic_bike"
table(trips_cleaning$rideable_type)
##
## classic_bike electric_bike
## 2699224 2977749
The final step will be to create several more columns for data visualization purposes, including day of the week, month, hour of the day, date and season of beginning the bike rental. Note that this can also be done in the Analyze phase.
I chose seasons rather than quarters due to similarities in monthly trends. For example, the winter months of December through February had more in common than the Q1 months of January through March.
trips_final <- trips_cleaning %>%
mutate(day_of_week = wday(trips_cleaning$started_at, label=TRUE)) %>%
mutate(month = month(trips_cleaning$started_at, label=TRUE)) %>%
mutate(hour = hour(trips_cleaning$started_at)) %>%
mutate(date = date(trips_cleaning$started_at)) %>%
mutate(season = case_when(
month == "Dec" | month == "Jan" | month == "Feb" ~ "Winter (Dec-Feb)",
month == "Mar" | month == "Apr" | month == "May" ~ "Spring (Mar-May)",
month == "Jun" | month == "Jul" | month == "Aug" ~ "Summer (Jun-Aug)",
month == "Sep" | month == "Oct" | month == "Nov" ~ "Fall (Sep-Nov)"))
To prevent ggplot from ordering months, days of the week and season alphabetically on the x-axis, we will manually set the order.
trips_final$month <-
factor(trips_final$month, levels = c("Jun", "Jul", "Aug", "Sep", "Oct",
"Nov", "Dec", "Jan", "Feb", "Mar",
"Apr", "May"))
trips_final$day_of_week <-
factor(trips_final$day_of_week, levels = c("Sun", "Mon", "Tue", "Wed",
"Thu", "Fri", "Sat"))
trips_final$season <-
factor(trips_final$season, levels = c("Winter (Dec-Feb)",
"Spring (Mar-May)",
"Summer (Jun-Aug)",
"Fall (Sep-Nov)"))
Now that we have the finalized version of trip history, ready to be analyzed in the next phase, we can write it to a .CSV file (optionally) and clear all other dataframes from memory.
# Write to .csv file. Uncomment next line as necessary
# write.csv(trips_final, "trips_final.csv", row.names=FALSE)
# Clear memory
rm(docked_bikes, null_end_coord, trips_cleaning, trips_combined)
During the Analyze phase, we can finally take a glimpse at data visualizations to identify trends and relationships between casual riders and members. Each sub-section will consist of code, visualization and an observation about the data.
Since we are working with large numbers as high as in the millions, ggplot often writes numbers on the y-axis in scientific notation. The following code will force it to write out the numbers fully.
# Y-axis values are fully written out
options(scipen = 999)
# Total annual rentals
ggplot(trips_final, aes(fill=member_casual, x=member_casual)) +
geom_bar(position="dodge") + # Non-stacking bars
labs(title = "Total Annual Bike Rentals",
x = "Member Type",
y = "Total Rentals",
caption = "Data from June 2022 - May 2023") +
theme(legend.position = "none") + # Hide legend
scale_y_continuous(labels = comma) # Comma between every 3 digits on y-axis
Observation: For the entire year, member trips
(~3.4 million) outnumber casual (~2.3 million).
# Total rentals by month
ggplot(trips_final, aes(fill=member_casual,x=month)) + geom_bar(position="dodge") +
labs(title = "Monthly Bike Rentals",
x = "Month",
y = "Rentals",
fill = "", # No legend key
caption = "Data from June 2022 - May 2023") +
scale_y_continuous(labels = comma)
Observation: Member trips outnumber casual trips
throughout the entire year. The amount varies by month, however. The gap
narrows greatly during the summer months, and in June and July casual
trips are almost as high as member. Casual trips are at their lowest
point during the winter months.
# Total rentals, by type of bike
ggplot(data = trips_final, mapping = aes(x = rideable_type, fill = member_casual)) +
geom_bar(position = "dodge") +
labs(title = "Total Rentals, by Type of Bike",
x = "Bike Type",
y = "Total Rentals",
fill = "",
caption = "Data from June 2022 - May 2023") +
scale_y_continuous(labels = comma)
Observation: For members, classic bikes and
ebike are about even. For casual riders, ebikes are 25% more popular
than classic.
# By month and type of bike
trips_final %>% ggplot(aes(fill=member_casual,x=month)) +
geom_bar(position="dodge") +
labs(title = "Monthly Rentals by Type of Bike",
x = "Month",
y = "Rentals",
fill = "",
caption = "Data from June 2022 - May 2023") +
scale_y_continuous(labels = comma) +
facet_wrap(~rideable_type)
Observation: For the most part, bike types and
membership status match up with overall monthly rental. For ebike trips,
casual actually outnumbers member during the months of June and
July.
# By day of week
ggplot(trips_final, aes(fill=member_casual,x=day_of_week)) + geom_bar(position="dodge") +
labs(title = "Total Bike Rentals by Day of Week",
x = "",
y = "Rentals",
fill = "",
caption = "Data from June 2022 - May 2023") +
scale_y_continuous(labels = comma)
Observation: For members, rentals peak mid-week
while casual trips peak at the weekend.
# By day of week, faceted by season
ggplot(trips_final, aes(fill=member_casual,x=day_of_week)) + geom_bar(position="dodge") +
labs(title = "Total Bike Rentals by Day of Week and Season",
x = "",
y = "Rentals",
fill = "",
caption = "Data from June 2022 - May 2023") +
scale_y_continuous(labels = comma) +
facet_wrap(~season)
Observation: The member mid-week vs. casual weekend
peaks stay true throughout the year, though the total amount of trips
varies greatly by season. The only time casual riders outnumber members
is during summer weekends.
# By hour of day
ggplot(data = trips_final, mapping = aes(x = hour, fill = member_casual)) +
geom_bar(position = "dodge") +
labs(title = "Bike Rentals by Hour of the Day",
x = "Hour",
y = "Bike Rentals",
fill = "",
caption = "By hour of starting rental. Data from June 2022 - May 2023") +
## Label x-axis with 12-hour AM/PM rather than 24-hour format, with 2-hour intervals
scale_x_continuous(breaks = c(0,2,4,6,8,10,12,14,16,18,20,22),
labels = c("12 AM", "2", "4", "6", "8", "10",
"12 PM", "2", "4", "6", "8", "10 PM"))
Observation: For casual riders, there is a steady
rise and fall between the lowest and highest points of 4 AM and 5 PM.
Members also peak at 5 PM, but they also have a smaller, second peak at
8 AM. Members heavily outnumber casuals for most of the waking hours,
while casuals have a slightly higher margin between 11 PM and 3 AM.
# By hour of day and type of bike
trips_final %>%
ggplot(mapping = aes(x = hour, fill = member_casual)) +
geom_bar(position = "dodge") +
labs(title = "Bike Rentals by Hour of the Day and Type of Bike",
x = "Hour",
y = "Bike Rentals",
fill = "",
caption = "By hour of starting rental. Data from June 2022 - May 2023") +
scale_x_continuous(breaks = c(0,3,6,9,12,15,18,21),
labels = c("12 AM", "3", "6", "9",
"12 PM", "3", "6", "9 PM")) +
facet_wrap(~rideable_type)
Observation: Results fall in line in terms of hours
of the day for casual riders and members. The only minor detail of note
is small bump in casual ebikes at 8 AM.
# By hour, faceted by season
ggplot(data = trips_final, mapping = aes(x = hour, fill = member_casual)) +
geom_bar(position = "dodge") +
labs(title = "Bike Rentals by Hour of the Day and Season",
x = "Hour",
y = "Bike Rentals",
fill = "",
caption = "By hour of starting rental. Data from June 2022 - May 2023") +
## Label x-axis with 12-hour AM/PM rather than 24-hour format, with 2-hour intervals
scale_x_continuous(breaks = c(0,3,6,9,12,15,18,21),
labels = c("12 AM", "3", "6", "9",
"12 PM", "3", "6", "9 PM")) +
facet_wrap(~season)
Observation: Members clearly outnumber casuals for
winter, spring and fall. However, in the summer casuals have more riders
in the afternoon between 12 PM and 3 PM as well as the late hours of 8
PM to 4 AM.
# Average rental duration, by month
# Create new dataframe for mean ride duration, by month
avg_duration_month <- aggregate(trips_final$ride_duration,
by=list(trips_final$month, trips_final$member_casual), FUN = mean)
# Rename columns
colnames(avg_duration_month) <- c("month", "member_casual", "avg_duration")
# Reorder months from May to April
avg_duration_month$month <-
factor(avg_duration_month$month,
levels = c("Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec",
"Jan", "Feb", "Mar", "Apr", "May"))
# View avg_duration_month
avg_duration_month
## month member_casual avg_duration
## 1 Jun casual 23.77184
## 2 Jul casual 23.57303
## 3 Aug casual 21.83277
## 4 Sep casual 20.36339
## 5 Oct casual 18.77402
## 6 Nov casual 15.76389
## 7 Dec casual 13.64016
## 8 Jan casual 13.87583
## 9 Feb casual 16.34605
## 10 Mar casual 15.61582
## 11 Apr casual 20.95757
## 12 May casual 22.50975
## 13 Jun member 13.92168
## 14 Jul member 13.72298
## 15 Aug member 13.37185
## 16 Sep member 12.90737
## 17 Oct member 11.80415
## 18 Nov member 11.08141
## 19 Dec member 10.60436
## 20 Jan member 10.39944
## 21 Feb member 10.80226
## 22 Mar member 10.56320
## 23 Apr member 11.88030
## 24 May member 12.93428
avg_duration_month %>% ggplot(aes(x = month, y = avg_duration, color = member_casual)) +
geom_point(size = 2) + geom_line(linewidth = 1, aes(group = member_casual)) +
labs(title = "Average Bike Rental Duration by Month",
x = "Month",
y = "Average Duration (minutes)",
color = "",
caption = "Data from June 2022 - May 2023")
Observation: Surprisingly, casual rides outnumber
members throughout the entire year despite single rides costing by the
minute. For members, the average ride duration is relatively stable at
10.4 minutes per trip in January to 13.9 minutes in June.
A key observation here is that, not only are casual rentals much
higher in the summer (see Monthly Bike Rentals), the average ride is
also much longer especially around the summer months. December is the
lowest at 13.6 minutes per trip, while in June this number is a whopping
23.8 minutes.
# Average rental duration, by day of week
# Create new dataframe for mean ride duration, by day of week
avg_duration_day <- aggregate(trips_final$ride_duration,
by=list(trips_final$day_of_week,
trips_final$member_casual), FUN = mean)
## Rename columns
colnames(avg_duration_day) <- c("day_of_week", "member_casual", "avg_duration")
## Force day of week order
avg_duration_day$day_of_week <- factor(avg_duration_day$day_of_week,
levels = c("Sun", "Mon", "Tue", "Wed",
"Thu", "Fri", "Sat"))
# View avg_duration_day
avg_duration_day
## day_of_week member_casual avg_duration
## 1 Sun casual 24.49880
## 2 Mon casual 20.99903
## 3 Tue casual 18.98336
## 4 Wed casual 18.36870
## 5 Thu casual 18.89725
## 6 Fri casual 20.35983
## 7 Sat casual 23.79690
## 8 Sun member 13.84187
## 9 Mon member 11.81774
## 10 Tue member 11.94008
## 11 Wed member 11.92348
## 12 Thu member 12.05692
## 13 Fri member 12.31501
## 14 Sat member 13.85365
avg_duration_day %>%
ggplot(aes(x = day_of_week, y = avg_duration, color = member_casual)) +
geom_point(size = 2) +
geom_line(linewidth = 1, aes(group = member_casual)) +
labs(title = "Average Bike Rental Duration by Day of the Week",
x = "Day",
y = "Average Duration (minutes)",
color = "",
caption = "Data from June 2022 - May 2023")
Observation: Similar to by month, casual trip
durations are longer than members. Though number of member trips peak
mid-week (see Total Rentals by Day of Week), the inverse is true in
terms of minutes per trip, which is highest at the weekend. Minutes vary
by just a couple minutes, with 11.8 minutes per trip on Monday to 13.9
minutes on Saturday.
Casuals trips are also highest at the weekend, and durations vary
between 18.3 minutes on Wednesday to 24.5 minutes on Sunday.
# By starting station
# Create three dataframes (total, casual, member) with a list unique start stations
# Also has a column for number of rentals for each station, sorted in descending order
start_stations <- count(trips_final, start_station_name, sort = TRUE)
start_stations_casual <-
count(trips_final %>% filter(member_casual == "casual"),
start_station_name, sort = TRUE)
start_stations_member <-
count(trips_final %>% filter(member_casual == "member"),
start_station_name, sort = TRUE)
# Rename second columns to "total" or member type
colnames(start_stations)[2] ="total"
colnames(start_stations_casual)[2] ="casual"
colnames(start_stations_member)[2] ="member"
# Merge three dataframes into one
start_stations_final <- full_join(start_stations, start_stations_casual,
by = "start_station_name") %>%
full_join(start_stations_member, by = "start_station_name")
# Clear 3 individual start_station dataframes from memory
rm(start_stations, start_stations_casual, start_stations_member)
# View first 20 rows of merged dataframe
head(start_stations_final, n = 20)
## # A tibble: 20 × 4
## start_station_name total casual member
## <chr> <int> <int> <int>
## 1 <NA> 796984 325889 471095
## 2 Streeter Dr & Grand Ave 71433 54486 16947
## 3 DuSable Lake Shore Dr & Monroe St 40096 30954 9142
## 4 Michigan Ave & Oak St 38854 24275 14579
## 5 DuSable Lake Shore Dr & North Blvd 38427 22499 15928
## 6 Wells St & Concord Ln 36959 15549 21410
## 7 Clark St & Elm St 35190 12510 22680
## 8 Kingsbury St & Kinzie St 33794 9081 24713
## 9 Millennium Park 33454 23964 9490
## 10 Theater on the Lake 32141 17754 14387
## 11 Wells St & Elm St 31254 11898 19356
## 12 Broadway & Barry Ave 30356 12002 18354
## 13 Clark St & Armitage Ave 28536 13081 15455
## 14 Clinton St & Washington Blvd 28044 6454 21590
## 15 Wilton Ave & Belmont Ave 27573 11536 16037
## 16 University Ave & 57th St 27289 6279 21010
## 17 Indiana Ave & Roosevelt Rd 27111 13298 13813
## 18 Wabash Ave & Grand Ave 26568 11003 15565
## 19 Clark St & Lincoln Ave 26531 12599 13932
## 20 Clinton St & Madison St 26392 7333 19059
Nearly 800,000 trips do not begin at a station. As previously mentioned, these should only be ebikes. We will verify this:
# Verify that all rentals with NULL starting stations are from ebikes
nrow(trips_final %>% filter(rideable_type == "electric_bike" & is.na(start_station_name)))
## [1] 796984
nrow(trips_final %>% filter(rideable_type == "classic_bike" & is.na(start_station_name)))
## [1] 0
We will go ahead and remove trips with NULL starting station from this dataset.
# Remove rows with NULL starting stations from dataframe
start_stations_final <- start_stations_final %>% filter(!is.na(start_station_name))
To better pinpoint stations with a higher potential to convert casual riders, we will add a column for casual:member ratio.
# Add column for casual:member ratio
start_stations_final <- start_stations_final %>% mutate(cm_ratio = casual / member)
# View first 20 rows of modified dataframe with casual:member ratio
head(start_stations_final, n = 20)
## # A tibble: 20 × 5
## start_station_name total casual member cm_ratio
## <chr> <int> <int> <int> <dbl>
## 1 Streeter Dr & Grand Ave 71433 54486 16947 3.22
## 2 DuSable Lake Shore Dr & Monroe St 40096 30954 9142 3.39
## 3 Michigan Ave & Oak St 38854 24275 14579 1.67
## 4 DuSable Lake Shore Dr & North Blvd 38427 22499 15928 1.41
## 5 Wells St & Concord Ln 36959 15549 21410 0.726
## 6 Clark St & Elm St 35190 12510 22680 0.552
## 7 Kingsbury St & Kinzie St 33794 9081 24713 0.367
## 8 Millennium Park 33454 23964 9490 2.53
## 9 Theater on the Lake 32141 17754 14387 1.23
## 10 Wells St & Elm St 31254 11898 19356 0.615
## 11 Broadway & Barry Ave 30356 12002 18354 0.654
## 12 Clark St & Armitage Ave 28536 13081 15455 0.846
## 13 Clinton St & Washington Blvd 28044 6454 21590 0.299
## 14 Wilton Ave & Belmont Ave 27573 11536 16037 0.719
## 15 University Ave & 57th St 27289 6279 21010 0.299
## 16 Indiana Ave & Roosevelt Rd 27111 13298 13813 0.963
## 17 Wabash Ave & Grand Ave 26568 11003 15565 0.707
## 18 Clark St & Lincoln Ave 26531 12599 13932 0.904
## 19 Clinton St & Madison St 26392 7333 19059 0.385
## 20 Clark St & Wrightwood Ave 26331 10375 15956 0.650
Next, we create a list of stations with at least 10,000 casual trips as well as at least a 1.0 casual:member ratio (can be edited as needed).
# Casual rentals and casual:member ratio can be edited as needed
start_stations_top <- start_stations_final %>% filter(casual > 10000 & cm_ratio > 1)
# View list of stations with >10,000 casual rentals and >1.0 casual:member ratio
start_stations_top
## # A tibble: 11 × 5
## start_station_name total casual member cm_ratio
## <chr> <int> <int> <int> <dbl>
## 1 Streeter Dr & Grand Ave 71433 54486 16947 3.22
## 2 DuSable Lake Shore Dr & Monroe St 40096 30954 9142 3.39
## 3 Michigan Ave & Oak St 38854 24275 14579 1.67
## 4 DuSable Lake Shore Dr & North Blvd 38427 22499 15928 1.41
## 5 Millennium Park 33454 23964 9490 2.53
## 6 Theater on the Lake 32141 17754 14387 1.23
## 7 Shedd Aquarium 24127 19394 4733 4.10
## 8 Montrose Harbor 20116 12306 7810 1.58
## 9 Dusable Harbor 19047 13941 5106 2.73
## 10 Michigan Ave & 8th St 17481 10995 6486 1.70
## 11 Adler Planetarium 15773 11169 4604 2.43
Since it is easier to use with wide data (as opposed to long data) for ggplot, we will use the melt function from the reshape library to convert it to wide format.
# Convert from wide data to long data for ggplot purposes
# Total rentals and casual:member columns are no longer needed
start_stations_long <- select(start_stations_top, start_station_name, casual, member)
# fixes dplyr and reshape library conflict
start_stations_long <- as.data.frame(start_stations_long)
start_stations_long <- melt(start_stations_long, id = "start_station_name")
# Rename the columns
colnames(start_stations_long) <- c("start_station_name", "member_casual", "rentals")
# Bar chart visualization, rotated on x-axis.
# Sorted by totals rentals in descending order
# By default, ggplot orders the stations by alphabetical street name.
# fct_reorder forces it to sort by value instead
start_stations_long %>%
mutate(start_station_name = fct_reorder(start_station_name, rentals)) %>%
ggplot(aes(x = start_station_name, y = rentals, fill = member_casual)) +
geom_bar(position="dodge", stat="identity") + coord_flip() +
labs(title = "Total Bike Rentals by Starting Station",
subtitle = "Stations with 10,000+ casual rentals and 1.0 casual:member ratio",
caption = "Data from June 2022 - May 2023",
x = "Starting Station", y = "Rentals", fill = "")
Observation: These 12 stations have the highest
number of casual bike rentals. “Street Dr & Grand Ave” leads the
pack with over 54,000 casual trips followed by “DuSable Lake Shore Dr
& Monroe St” with 31,000. Both of these stations also have over 3
times as many casual riders than members, so there could be a lot of
untapped potential to convert these customers.
It should be noted, however, that we have no data on how many of
these casual riders are tourists vs. local residents (and we are only
concerned with the latter). We can consider disregarding stations which
are near a popular tourist attraction.
For comparison purposes, we can repeat the entire process for ending
stations.
# List unique end stations, frequency of each, descending order
end_stations <- count(trips_final, end_station_name, sort = TRUE)
end_stations_casual <- count(trips_final %>%
filter(member_casual == "casual"),
end_station_name, sort = TRUE)
end_stations_member <- count(trips_final %>%
filter(member_casual == "member"),
end_station_name, sort = TRUE)
# Rename second columns
colnames(end_stations)[2] ="total"
colnames(end_stations_casual)[2] ="casual"
colnames(end_stations_member)[2] ="member"
# Merge dataframes
end_stations_final <-
full_join(end_stations, end_stations_casual, by = "end_station_name") %>%
full_join(end_stations_member, by = "end_station_name")
# Add column for casual:member ratio
end_stations_final <- end_stations_final %>% mutate(cm_ratio = casual / member)
# Create list stations with at least 1.0 casual:member ratio and 10000+ rentals
end_stations_top <- end_stations_final %>% filter(casual > 10000 & cm_ratio > 1)
##Convert End Stations data frame to long format (since ggplot does not like wide format)
library(reshape)
end_stations_long <- select(end_stations_top, end_station_name, casual, member)
end_stations_long <- as.data.frame(end_stations_long) ## dplyr and reshape conflict
end_stations_long <- melt(end_stations_long, id = "end_station_name")
colnames(end_stations_long) <- c("end_station_name", "member_casual", "rentals")
# Bar chart visualization, rotated on x-axis.
# Sorted by casual rentals, descending
# By default, ggplot orders the stations by alphabetical street name.
# fct_reorder forces it to sort by value instead
end_stations_long %>%
mutate(end_station_name = fct_reorder(end_station_name, rentals)) %>%
ggplot(aes(x = end_station_name, y = rentals, fill = member_casual)) +
geom_bar(position="dodge", stat="identity") + coord_flip() +
labs(title = "Bike Rentals by Ending Station",
subtitle = "Stations with 10,000+ casual rentals and 1.0 casual:member ratio",
caption = "Data from June 2022 - May 2023",
x = "Ending Station", y = "Rentals", fill = "")
Observation: Comparing the starting and ending
station results, we can see the same stations listed albeit in a
slightly different order. The only station which does not appear on both
lists is Michigan Ave & 8th St, as its 9,631 ending station
trips falls just below the 10,000 quota.
The previous section only considers each station individually, rather than a geographical area which may have several popular several stations in close proximity. Targeting these areas could be more cost effective for the marketing budget than individual stations.
First, let’s take a look at the coordinates of the most popular station with casual riders, “Streeter Dr & Grand Ave.”
# View trips starting from most popular starting station, "Streeter Dr & Grand Ave"
head(trips_final %>%
filter(start_station_name == "Streeter Dr & Grand Ave") %>%
select(start_station_name, start_lng, start_lat), n = 20L)
## # A tibble: 20 × 3
## start_station_name start_lng start_lat
## <chr> <dbl> <dbl>
## 1 Streeter Dr & Grand Ave -87.6 41.9
## 2 Streeter Dr & Grand Ave -87.6 41.9
## 3 Streeter Dr & Grand Ave -87.6 41.9
## 4 Streeter Dr & Grand Ave -87.6 41.9
## 5 Streeter Dr & Grand Ave -87.6 41.9
## 6 Streeter Dr & Grand Ave -87.6 41.9
## 7 Streeter Dr & Grand Ave -87.6 41.9
## 8 Streeter Dr & Grand Ave -87.6 41.9
## 9 Streeter Dr & Grand Ave -87.6 41.9
## 10 Streeter Dr & Grand Ave -87.6 41.9
## 11 Streeter Dr & Grand Ave -87.6 41.9
## 12 Streeter Dr & Grand Ave -87.6 41.9
## 13 Streeter Dr & Grand Ave -87.6 41.9
## 14 Streeter Dr & Grand Ave -87.6 41.9
## 15 Streeter Dr & Grand Ave -87.6 41.9
## 16 Streeter Dr & Grand Ave -87.6 41.9
## 17 Streeter Dr & Grand Ave -87.6 41.9
## 18 Streeter Dr & Grand Ave -87.6 41.9
## 19 Streeter Dr & Grand Ave -87.6 41.9
## 20 Streeter Dr & Grand Ave -87.6 41.9
# Get summary
summary(trips_final %>%
filter(start_station_name == "Streeter Dr & Grand Ave") %>%
select(start_station_name, start_lng, start_lat))
## start_station_name start_lng start_lat
## Length:71433 Min. :-87.81 Min. :41.71
## Class :character 1st Qu.:-87.61 1st Qu.:41.89
## Mode :character Median :-87.61 Median :41.89
## Mean :-87.61 Mean :41.89
## 3rd Qu.:-87.61 3rd Qu.:41.89
## Max. :-87.61 Max. :41.97
We can see that most of the coordinates are the same or very near to each other. However, from the summary the min/max longitude or latitude can be 0.2 degrees away from the average. For reference, 0.1 degrees is 11.1 km (6.9 mi). Since the 1st and 3rd quadrant coordinates are the same, we can use the median longitude and latitude values for the next step.
We will be using the top stations from start_stations_long and calculating their median latitude and longitude values from trips_final.
# Create dataframe with stations and longitude, then rename columns
station_lng <- aggregate(trips_final$start_lng, by=list(trips_final$start_station_name),
FUN = mean)
colnames(station_lng) <- c("start_station_name", "lng")
# Create dataframe with stations and latitude, then rename columns
station_lat <- aggregate(trips_final$start_lat, by=list(trips_final$start_station_name),
FUN = mean)
colnames(station_lat) <- c("start_station_name", "lat")
# Full join two dataframes
station_coord <- full_join(station_lng, station_lat, by = "start_station_name")
# Left join onto our previous start_stations_top dataframe, adding latitude
start_stations_top_with_coord <- left_join(start_stations_top, station_coord,
by = "start_station_name")
# Sort by number of casual riders, decreasing
start_stations_top_with_coord <-
start_stations_top_with_coord[order(start_stations_top_with_coord$casual,
decreasing = TRUE),]
Next, we need to use the ggmaps library and functions to create maps of Chicago.
Note that this process requires the user to register for a private Google Maps API Key, and you do get charged for each query. Though Stamen Maps is a free alternative, I prefer the appearance of Google Maps.
Rather than having to execute the query each time I run this notebook (and racking up charges), I ran the following code chunk once and saved the maps to my local drive for use in future sessions. I set different zoom levels and different Chicago landmarks to center the map on because if I set location to “Chicago,” the entire east half of the map will just be Lake Michigan.
# This entire section only needs to be run once, therefore all lines are commented out
# The following code is required prior to using the get_map() function
# register_google(key = "PRIVATE GOOGLE MAPS API KEY")
# Access various zoom levels (higher number = increased zoom) of Chicago,
# centered on a Chicago landmark
# chicago_11_zoom <- get_map(location = "Garfield Park Conservatory", zoom = 11)
# chicago_12_zoom <- get_map(location = "Old Town Chicago", zoom = 12)
# chicago_13_zoom <- get_map(location = "Quartino Ristorante Chicago", zoom = 13)
# Save maps to local drive
# save(chicago_11_zoom, file = "chicago_11_zoom.RData")
# save(chicago_12_zoom, file = "chicago_12_zoom.RData")
# save(chicago_13_zoom, file = "chicago_13_zoom.RData")
On subsequent sessions, we can just load the maps from the local drive.
setwd("./data") # Set working directory
# Load maps from locally saved files
load(file = "chicago_11_zoom.RData")
load(file = "chicago_12_zoom.RData")
load(file = "chicago_13_zoom.RData")
Let us take a look at our most zoomed out map with the top starting stations plotted.
ggmap(chicago_11_zoom) +
# Plot a point for each of the top starting stations
# Sized according to number of casual trips
geom_point(data = start_stations_top_with_coord,
mapping = aes(x = lng, y = lat, size = casual), alpha = 0.5)
We can see that most of the stations are in the central area along the lake. Let’s zoom in a bit more, add station name labels, and remove unnecessary x- and y-axis coordinate information.
ggmap(chicago_12_zoom) +
# Plot a point for each of the top starting stations
# Sized according to number of casual trips
geom_point(data = start_stations_top_with_coord,
mapping = aes(x = lng, y = lat, size = casual), alpha = 0.5, color = "red") +
# Add station name as a label. hjust to offset to the right
geom_text(data = start_stations_top_with_coord,
mapping = aes(x = lng, y = lat, label = start_station_name),
size = 2, hjust = -0.1) +
labs(title = "Busiest Stations by Number of Casual Riders",
x = "", y = "", size = "Total casual trips") +
# Remove all x- and y-axis text and ticks
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
And finally, zoom in one more time and re-run the code. Note that our northernmost station, “Montrose Harbor” falls outside the boundaries of the map, which results in the “missing values” warning.
ggmap(chicago_13_zoom) +
# Plot a point for each of the top starting stations
# Sized according to number of casual trips
geom_point(data = start_stations_top_with_coord,
mapping = aes(x = lng, y = lat, size = casual), alpha = 0.5, color = "red") +
# Add station name as a label. hjust to offset to the right
geom_text(data = start_stations_top_with_coord,
mapping = aes(x = lng, y = lat, label = start_station_name),
size = 2, hjust = -0.1) +
labs(title = "Busiest Stations by Number of Casual Riders",
x = "", y = "", size = "Total casual trips") +
# Remove all x- and y-axis text and ticks
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
## Warning: Removed 1 rows containing missing values (`geom_point()`).
## Warning: Removed 1 rows containing missing values (`geom_text()`).
Observation: Of our 11 starting stations, 10 are
concentrated near the waterfront between “Shedd Aquarium” in the south
and “Theater on the Lake” in the north. “Montrose Harbor” is a bit
further to the north of this area.
Finally, the presentation concludes with a few recommendations on how to convert casual riders into members, in addition to an Appendix with links to the original Divvy dataset and this R Markdown file.